Predictive model for identifying mild cognitive impairment in patients with type 2 diabetes mellitus: A CHAID decision tree analysis

Abstract Background As the population ages, mild cognitive impairment (MCI) and type 2 diabetes mellitus (T2DM) become common conditions that often coexist. Evidence has shown that MCI could lead to reduced treatment compliance, medication management, and self‐care ability in T2DM patients. Therefore, early identification of those with increased risk of MCI is crucial from a preventive perspective. Given the growing utilization of decision trees in prediction of health‐related outcomes, this study aimed to identify MCI in T2DM patients using the decision tree approach. Methods This hospital‐based case–control study was performed in the Endocrinology Department of Xiangya Hospital affiliated to Central South University between March 2021 and December 2022. MCI was defined based on the Petersen criteria. Demographic characteristics, lifestyle factors, and T2DM‐related information were collected. The study sample was randomly divided into the training and validation sets in a 7:3 ratio. Univariate and multivariate analyses were performed, and a decision tree model was established using the chi‐square automatic interaction detection (CHAID) algorithm to identify key predictor variables associated with MCI. The area under the curve (AUC) value was used to evaluate the performance of the established decision tree model, and the performance of multivariate regression model was also evaluated for comparison. Results A total of 1001 participants (705 in the training set and 296 in the validation set) were included in this study. The mean age of participants in the training and validation sets was 60.2  ±  10.3 and 60.4  ±  9.5 years, respectively. There were no significant differences in the characteristics between the training and validation sets (p > .05). The CHAID decision tree analysis identified six key predictor variables associated with MCI, including age, educational level, household income, regular physical activity, diabetic nephropathy, and diabetic retinopathy. The established decision tree model had 15 nodes composed of 4 layers, and age is the most significant predictor variable. It performed well (AUC = .75 [95% confidence interval (CI): .71–.78] and .67 [95% CI: .61–.74] in the training and validation sets, respectively), was internally validated, and had comparable predictive value compared to the multivariate logistic regression model (AUC = .76 [95% CI: .72–.80] and .69 [95% CI: .62–.75] in the training and validation sets, respectively). Conclusion The established decision tree model based on age, educational level, household income, regular physical activity, diabetic nephropathy, and diabetic retinopathy performed well with comparable predictive value compared to the multivariate logistic regression model and was internally validated. Due to its superior classification accuracy and simple presentation as well as interpretation of collected data, the decision tree model is more recommended for the prediction of MCI in T2DM patients in clinical practice.


INTRODUCTION
Mild cognitive impairment (MCI) and type 2 diabetes mellitus (T2DM) are highly prevalent and often coexist in older adults (Srikanth et al., 2020).MCI is a transitional stage between normal ageing and dementia that was characterized by cognitive dysfunction with minimal impairment in instrumental activities of daily living (Petersen, 2004;Petersen et al., 1999).Accumulated evidence has consistently shown that the presence of MCI in the general population was associated with increased risk of dementia (Davis et al., 2018;Mitchell & Shiri-Feshki, 2009;Zhang et al., 2021).In addition to that, MCI in T2DM could lead to reduced treatment compliance, medication management, and self-care ability (Kim & Fritschi, 2021;Verma et al., 2021), which may be explained by the decrements in working memory, learning, executive function, and processing speed observed in MCI (Christman et al., 2010;Santos et al., 2018), though the function of daily activities is essentially preserved.Therefore, the management of MCI in T2DM patients is crucial, and early identification of those with MCI is a starting point.
The decision tree is one of the most commonly used machine learning models in a wide range of medical situations requiring decision-making that provides high classification accuracy with a simple representation of gathered knowledge (Bae, 2014;Podgorelec et al., 2002).Compared to other machine learning models, the decision tree has the following advantages: (1) it can be visualized and is simple to understand and interpret in clinical practice; (2) it provides a remarkably transparent decision-making process, allowing deep exploration of features; and (3) due to its high transparency, the decision-making process can be easily validated by an expert that greatly enhances its utility in situations containing high uncertainty (Amendolara et al., 2023;Bae, 2014;Plante et al., 1986;Podgorelec et al., 2002;Ting Sim et al., 2023).A growing body of literature has demonstrated the effectiveness of decision trees in predicting the occurrence as well as the prognosis of health-related outcomes (Toyoda et al., 2023;Yang et al., 2021;Zhou et al., 2023), and some studies directly compared decision tree model with common machine learning models to solve prediction problems (Hu et al., 2022;Langenberger et al., 2023).Previous work on comparison of logistic regression and decision tree models found comparable predictive values (Wentzlof et al., 2019;Zhang et al., 2022).However, as the interpretation and understanding of logistic regression model are more difficult than that of decision tree model, especially for those without experience with this particular model type, decision tree model is more recommended in clinical practice (Fu et al., 2022).In the diabetic population, few studies to date have used the decision tree models to predict diabetic comorbidities or complications (Kasbekar et al., 2017;Rinkel et al., 2020;Zhou et al., 2022), and to our knowledge, there is no study on prediction of MCI in patients with T2DM using the decision tree model.Given the growing utilization of decision trees in prediction of health-related outcomes and the negative effects of MCI on the prognosis of T2DM patients, this study aimed to identify MCI in T2DM patients using the decision tree approach to help better identify MCI.

METHODS
This study adhered to the principles of the Declaration of Helsinki.The protocol of this study was reviewed and approved by the Ethics Committee of the Xiangya School of Public Health, Central South University.
Before enrollment, a written informed consent was obtained from each participant.

Participants
The inclusion criteria for the participants were as follows: (1) a clini- (1) repeated hospitalizations during the study period; (2) those with dementia; and (3) those could not speak or understand Mandarin.
Eligible participants who met the diagnostic criteria for MCI were considered cases, whereas those unfulfilled were considered controls with normal cognitive function.

Outcome of interest
The outcome of this study is MCI, and it was defined based on the Petersen criteria by the physicians (Petersen, 2004)

Data collection
Demographic characteristics and lifestyle factors were obtained retrospectively through face-to-face interviews by well-qualified investiga-tors.T2DM-related information was extracted from medical records.
All investigators underwent unified training and were blinded to the cognitive status of the participants.

Predictor variables
The

Statistical analyses
The study sample was randomly divided into the training and validation sets in a 7:3 ratio.The training set was used to develop the decision tree model and the validation set was used to validate the decision tree model internally.Based on the fact that excluding all observations with missing values could induce substantial bias as well as lack of efficiency, missing data were filled using random forest interpolation (Salgado, et al., 2016;Fox-Wasylyshyn & El-Masri, 2005).
Continuous variables were reported as mean ± standard deviation or median and interquartile range as appropriate, and categorical variables were reported as frequency (n) and proportion (%).Chi-square test or Fisher's exact test was used to compare the categorical predictor variables in the training set as appropriate, and predictor variables that differed significantly were entered into the multivariate logistic regression model.
The chi-square automatic interaction detection (CHAID) algorithm, whose main purpose was to identify key factors related to the outcomes of interest, was used to develop the decision tree model (Kass, 1980).In this algorithm, homogenous groups could be constructed by any possible combination of the known values of a predictor variable.
The number of predictor variables for creating the decision tree model depends on the values of chi-square test and whether the differences are statistically significant or not.There are three types of nodes in decision tree: (1) a root node, representing a choice that will induce the subdivision of all records into two or more mutually exclusive subsets; (2) internal nodes, representing one of the possible choices that are available at that point in the tree structure; and (3) leaf nodes, representing the final result of a combination of decisions (Song & Lu, 2015).The node containing all the cases is considered the root node in the CHAID algorithm, and the predictor variable with the largest chi-square value divides the entire sample into at least two subgroups, which are subsequently split by the next most significant predictor variable.The analysis continues in this stepwise way to choose the next most significant predictor variable until there are no more significant predictor variables.
The performance of the decision tree model and multivariate logistic regression model was evaluated by the receiver operating characteristic (ROC) curves and the area under the curve (AUC).All statistical analyses were two-sided and performed using the IBM SPSS software (version 26.0) or R software (version 4.2.1;https://www.r-project.org/).
A p value of <.05 was considered statistically significant.

Univariate analyses of factors associated with MCI in the training set
The univariate analyses of factors associated with MCI in the training set are shown in Table 2. Age, sex, marital status, educational level, household income, location of residence, primary caregiver, current work status, current smoker, current drinker, regular physical activity, duration of diabetes, stroke, hypertension, coronary heart disease, fatty liver, diabetic nephropathy, diabetic retinopathy, and diabetic foot differed significantly between the cases and controls in the training set (p < .05).

Validation of the decision tree model
The ROC curves for the decision tree models and multivariate logistic regression models in the training and validation sets are shown in Figure 2. The AUC value of the decision tree models was .75(95% CI: .71-.78) and .67(95% CI: .61-.74) in the training and validation sets, respectively, and that of the multivariate logistic regression models was 0.76 (95% CI: .72-.80) and 0.69 (95% CI: .62-.75) in the training and validation sets, respectively.

DISCUSSION
This study developed a decision tree model to help identify MCI in patients with T2DM by comprehensively taking demographic characteristics, lifestyle factors, and T2DM-related information into account.
To the best of our knowledge, this was the first study attempting to use the CHAID decision tree analysis to identify MCI in patients with T2DM.Previous studies that directly compared the performance of conventional logistic regression and decision tree models to solve prediction problems found comparable results.For example, Kuang et al. (2021) found that both logistic regression and decision tree models performed well at predicting the transition from MCI to Alzheimer's disease with ideal stability; Yu et al. (2024) found that the AUC value was .868(95% CI: .821-.916) and .863(95% CI: .814-.912) for the logistic regression and decision tree models to identify suicidal ideation in schizophrenia patients, respectively; and similar prediction accuracy in these two approaches was also observed in a study using the self-reported clinical symptoms of dengue fever to predict potential dengue infection (Khosavanna et al., 2021).Compared to logistic regression, the main advantages of decision tree model are easy visualization and simplicity; the results are transformed to a set of decision rules that are similar to clinical reasoning; and there is an intuitive and straightforward explanation about how the decision process was made.This study added significantly to the existing body of knowledge by indicating that both logistic regression and decision tree models performed well at predicting MCI in patients with T2DM.Additionally, the decision tree model established in this study identified 6 key predictor variables with 15 nodes composed of 4 layers.It was simple and easy to understand without issues on multiple levels and nonrelevant splits.Therefore, the utilization of decision tree model to identify MCI in T2DM patients was more suggested in clinical practice considering its superior classification accuracy and simple presentation as well as interpretation of collected data.
In terms of the key predictor variables identified by the decision tree model, age, educational level, household income, and regular physical activity were well-known factors that were frequently observed by previous work in the general population (Biessels et al., 2008;Dominguez et al., 2021;Jia et al., 2020;Zhang et al., 2019), whereas diabetic nephropathy and retinopathy were factors specifically limited to the diabetic population.As the most significant predictor variable for the identification of MCI in T2DM patients, the increased risk of MCI in those aged ≥60 observed in this study could be explained by the brain structure changes and decreased brain functioning that occur with increased age (Mankovsky et al., 2018).Educational level has been believed to be the strongest noncognitive factor affecting cognitive function (Fan et al., 2021;Pedraza et al., 2017), and the protective effects of high household income against MCI in T2DM may be attributed to higher social engagement and more social resources (Coughlin, 2020), as well as the ability to tolerate higher levels of neuropathology, which would be beneficial for the maintenance of cognitive function (Pais et al., 2021).Additionally, the role of exercise in maintaining cognitive function has been well documented (Donnelly et al., 2016;Houston et al., 2018), and this study contributed to the existing knowledge by supporting the protective effects of regular physical activity against MCI in patients with T2DM.
Diabetic nephropathy and retinopathy were the most prevalent microvascular complications of T2DM (Seewoodhary, 2021).Previous work has linked these two complications with a wide range of The established decision tree model.
poor health-related outcomes including reduced quality of life and increased risk of depression, anxiety, and bipolar disorder (Chen et al., 2023;Edalat-Nejad et al., 2014;Khoo et al., 2019;Mahobia et al., 2021;Valluru et al., 2023).In addition to that, this study found diabetic nephropathy and retinopathy were both associated with higher risk of MCI.Potential mechanisms underlying the association between diabetic nephropathy and cognitive impairment included vascular dysfunction, lymphatic dysfunction, decreased clearance of uremic toxins, and hemodynamic changes during dialysis, which could lead to cognitive decline (Drew et al., 2019).Additionally, the impacts of diabetic retinopathy on cognitive function may be explained by the processes of shared pathways, including neuroinflammation and degeneration, vascular degeneration, and glial activation (Little et al., 2022).Therefore, the findings of this study strongly highlighted the importance of strengthening the management of diabetic complications including dia- studies with comparisons between various machine learning models are warranted.
betic nephropathy and retinopathy for maintaining cognitive function of T2DM patients in clinical practice.In conclusion, this study developed and validated a decision tree model to help identify MCI in T2DM patients in clinical practice by employing a large sample size.The established decision tree model based on age, educational level, household income, regular physical activity, diabetic nephropathy, and diabetic retinopathy performed well with comparable predictive value compared to the multivariate logistic regression model and was internally validated.However, some limitations need to be acknowledged.First, this study was hospitalbased.Therefore, whether the findings of this study can be generalized into those from community settings remains unclear.Second, though internally well validated, there was still a lack of external validation for the findings of this study.Third, this study only compared the performance of decision tree model with logistic regression, and it remains a possibility that other machine learning models, such as random forest, gradient boosting machine, and artificial neural network, could perform better than the decision tree model for the identification of MCI in T2DM patients.Therefore, future community-based F I G U R E 2 The receiver operating characteristic curves for the decision tree and multivariate logistic regression models: a-Multivariate logistic regression model in the training set; b-decision tree model in the training set; c-multivariate logistic regression model in the validation set; d-decision tree model in the validation set; e-reference line.
Characteristics of the study participants.Univariate analyses of factors associated with MCI in the training set.Multivariate analysis of factors associated with mild cognitive impairment (MCI) in the training set.
The principle finding of this study was that compared to the multivariate logistic regression model, the established decision tree model had comparable predictive value and was well internally validated.TA B L E 1Abbreviations: BMI, body mass index; HbA1c, glycated hemoglobin A1c; RMB, renminbi.aindicatedtherewere 45 missing values of this variable, which was filled using the random forest interpolation.TA B L E 2Abbreviations: BMI, body mass index; HbA1c, glycated hemoglobin A1c; MCI, mild cognitive impairment; RMB, renminbi.TA B L E 3